home *** CD-ROM | disk | FTP | other *** search
Text File | 1993-04-09 | 2.5 KB | 68 lines | [TEXT/KEEN] |
- #$FrequencyWord: print sorted list of words, together with number
- #of times each word occurs. Use with any input option,
- #select "Show stdout" since that's where the output goes.
- #This program prints the words in order by frequency of occurrence,
- #whereas $WordFrequency prints them alphabetically.
- #The file "common words" in the "hAWK programs" folder contains
- #a list of words to skip. To do a better job, you can create a
- #custom list - for example, the word "while" can be skipped in
- #ordinary text, but should be included if the text deals with
- #C or hAWK programming. If this file is missing, the program
- #will still run, but common words will not be skipped (this
- #uses more memory, and runs slower).
- #Tech note: the words in “common words” are loaded into the
- #(associative) array “common[]”; as with all arrays in hAWK,
- #retrieval of an element is done with a hash table, so retrieval
- #of an element given the index or checking for the existence
- #of an index with the “in” operator is very fast. Thus there would
- #be no real advantage to keeping the common words in alphabetical
- #order. Also, duplicate words cause no problems.
-
- #This isn't perfect, but is very useful as-is. It's a simple
- #program, one you can tinker with easily - try it out on
- #some small files, and refinements will suggest themselves.
-
- # User’s Manual references:
- # «hAWK User’s Manual» «F Running hAWK programs»
- # «hAWK User’s Manual» «L 5 Regular expressions»
- # «hAWK User’s Manual» «M 5 Built-in string and file functions»
- # «hAWK User’s Manual» «K 4 Built-in variables»
- # «hAWK User’s Manual» «K 8 Arrays»
- # «hAWK User’s Manual» «N User-defined functions»
- # «hAWK User’s Manual» «P 3 The getline function»
- # «hAWK User’s Manual» «O 3 Output into files»
- # «hAWK User’s Manual» «Q The hAWK function»
-
- BEGIN { #Get list of common words to skip.
- commonfile = STDPATH "Drag_on Modules:hAWK programs:" "common words"
- while (getline < commonfile > 0)
- {
- for ( k = 1; k <= NF; k++)
- common[$k] = 1; #Forces common[$k] to "exist".
- }
- close(commonfile)
- $0 = ""
- ## time_it = 1
- if (time_it == 1)
- print "Starting time", time()
- }
-
-
- { #Remove non-word characters, count words.
- gsub(/[^A-Za-z_0-9$'-]+/, " ")
- #or try gsub(/\W+/, " ") #W == [^A-Z_a-z0-9]
- for ( k = 1; k <= NF; k++)
- {
- if (length($k) > 1 && !($k in common))
- count[$k]++;
- }
- }
- END { #Sort associative array, and print words with count.
- m = sort(count, ind, "rn")
- for (j = 1; j <= m; ++j)
- print ind[j], "\t\t", count[ind[j]]
- if (time_it == 1)
- print "Finishing time", time()
- }
-
-